Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split providers out of the main "airflow/" tree into a UV workspace project #42505

Merged
merged 1 commit into from
Oct 9, 2024

Conversation

ashb
Copy link
Member

@ashb ashb commented Sep 26, 2024

closes #42857

As discussed https://lists.apache.org/thread/dyv5jhvt65xs6l5o2byc2b67f4wlwf6r, this is the first part of the new layout.

This is only a partial split so far. It moves all the code and tests, but the provider release generation code hasn't moved yet as it's a few weeks until the next batch, and I wan't to try and keep this already huge PR as small as possible. Likewise the creation of core/ will be a follow on PR

In addition to the straight file rename the other changes I had to make here are:

  • Some mypy/typing fixes.

    Mypy can be fragile about what it picks up when, so maybe some of those
    changes were caused by that. But the typing changes aren't large.

  • Improve typing in common.sql type stub

    Again, likely a mypy file oddity, but the types should be safe

  • Removed the check-providers-init-file-missing check

    This isn't needed now that airflow/providers shouldn't exist at all in the
    main tree.

  • Create a "dev.tests_common" package that contains helper files and common
    pytest fixtures

    Since the provider tests are no longer under tests/ they don't automatically
    share the fixtures from the parent tests/conftest.py so they needed
    extracted.

    Ditto for tests.test_utils -- they can't be easily imported in provider
    tests anymore, so they are moved to a more explicit shared location.

In future we should switch how the CI image is built to make better use of UV caching than our own approach as that would remvoe a lot of custom code.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@ashb
Copy link
Member Author

ashb commented Sep 26, 2024

Oh, prepare provider packages is failing for prod image build. I guess that's going to have to be in this PR then

@ashb ashb force-pushed the uv-workspace-for-providers branch 10 times, most recently from 056127e to 2e952e5 Compare September 30, 2024 20:40
@potiuk
Copy link
Member

potiuk commented Oct 1, 2024

Smalll comment (I am returning back from holidays) and slowly going through the two week's history.

In future we should switch how the CI image is built to make better use of UV caching than our own approach as that would remvoe a lot of custom code.

I tried various things before (including uv caching and local cache mounts) and I could not find better solution. With all the other approaches I tried, our approach is way more efficient for size and time and especially shines because with the current approach you have a single remote cache layer in ghcr.io that is used even if dependencies change, and you do not have cache duplication.

There are two problems with using uv cache:

  1. Enabling UV cache (I tested it having similar hope when UV was released) makes our image 4 GB instead of 2GB (can't remember exactly but it was almost 2x factor). also it does not work with docker layer caching. It's very time-efficient to keep the UV cache but yoy pay a huge price of used space. Note that UV will cache not only the actual packages that it downloads but also the packages it considers as candidates - which in case of significant backtracking might mean that you keep multiple versions of the same package in the cache - this adds up very quickly and uv cache is really huge.

UV cache benefits from the fact that the same cache is potentially reused by multiple projects you have installed on the same machine. The UV cache is stored in HOME dir and reused between all venvs used in the same machine - so in docker it's nearly useless because it multiplies the space used for both - installed and cached files but cached files are not reusable in any way for other venvs (because we have only one venv in docker container). And since layers in docker image cannot be removed, even if we delete the cache after final installation step - the cache will remain in the docker image layer additionally to installed packages - increasing significantly size of the final image.

Of course - you could potentially use locally mounted cache volumes in docker to keep the UV cache on, in order to not pollute docke image, but that would only be useful after you build your image locally once and you won't be able to use remotely cached layers which are currently used when you build the image.

Using remote cache in the way we do it now, is much nicer, because the cache is actually the same as most installed packages (with the exception of the new/ updated packages change after the original cached layer is created. This has the nice property that every time new patchlevel base image is released by python (every two-three weeks or so) the cache is rebuit from scratch, and we get a new "fresh" main cache with latest package versions.

  1. Another problem is that you will not avoid the problem that you need to have all the "source" dependency files copied to the image layer in order to run uv pip install command.

Whatever that command will produce as cache will be invalidated the moment the source dependency files (pyproject.toml, hatch_build.py, provider.yaml files, generated/provider_dependencies.json) - so if you want to effectively use cache, you need to build it BEFORE you copy any of those files into the new image. This is precisely what "our" caching is doing - it does not use any "local file" and does not copy them to the image, but uses "https://github" URL to install the dependencies for the first time - which means that next time that layer will not be invalidated when pyproject.toml or hatch_build.py or any other files that decide about dependencies will change.

Now - I'd love to make it simpler, but so far all the attempts to find a better solution failed because the one we have is the only one that has good "caching" properties.

@potiuk
Copy link
Member

potiuk commented Oct 1, 2024

But maybe you can come up with a better approach that I have not thought about and tried already, of course :)

@ashb
Copy link
Member Author

ashb commented Oct 1, 2024

Yeah, that comment was mostly a note to myself, and not something I plan on even looking at as part of this work.

@ashb
Copy link
Member Author

ashb commented Oct 2, 2024

I think I'm getting closer now 🤞🏻 .

Oddly enough a lot of the time the er diagram static check is failing in CI , even though I've run breeze static-checks -a update-er-diagram and it says pass.

Edit: this turned out to be a (purposefully) untracked file in the migrations folder in my local checkout.

@ashb ashb force-pushed the uv-workspace-for-providers branch 2 times, most recently from 0956d95 to 36ab177 Compare October 8, 2024 10:58
utkarsharma2 pushed a commit that referenced this pull request Oct 24, 2024
…#43307)

While the #43260 attempted to address the problem where example
dag importability tests should skip provider tests on non-main,
it did not actually solve the problem.

While debugging it, it turned out that since #42505, the provider
tests were not executed in main "at all" - the "providers"
directory was not included in the list of places to check for
the example dags (they were in "airflow" in v2-10-test") this is
why it "looked like" the solution worked in "main".

This PR fixes both problems:

* brings back importability of provider's example_dags in main branch
* properly excludes the providers examples in non-main branch
@o-nikolas o-nikolas mentioned this pull request Oct 24, 2024
potiuk added a commit to potiuk/airflow that referenced this pull request Oct 27, 2024
Another teething problem after moving providers in apache#42505. After
moving providers, the history of the current folder in "providers"
only contains changes after the move - it does not include
changes from before the move - and since we always regenerate
the full list of commits - they were missing.

We cannot use `--follow` - because `git log --follow` only
works for single files, not directories, but since the move
was very predictable ("airflow/providers/nnn" ->
"airflow/providers/src/airflow/providers/nnn") we can add the old
path to `git log` command to get both - pre and post move commit
history.
potiuk added a commit to potiuk/airflow that referenced this pull request Oct 27, 2024
Another teething problem after moving providers in apache#42505. After
moving providers, the history of the current folder in "providers"
only contains changes after the move - it does not include
changes from before the move - and since we always regenerate
the full list of commits - they were missing.

We cannot use `--follow` - because `git log --follow` only
works for single files, not directories, but since the move
was very predictable ("airflow/providers/nnn" ->
"airflow/providers/src/airflow/providers/nnn") we can add the old
path to `git log` command to get both - pre and post move commit
history.
potiuk added a commit that referenced this pull request Oct 27, 2024
Another teething problem after moving providers in #42505. After
moving providers, the history of the current folder in "providers"
only contains changes after the move - it does not include
changes from before the move - and since we always regenerate
the full list of commits - they were missing.

We cannot use `--follow` - because `git log --follow` only
works for single files, not directories, but since the move
was very predictable ("airflow/providers/nnn" ->
"airflow/providers/src/airflow/providers/nnn") we can add the old
path to `git log` command to get both - pre and post move commit
history.
potiuk added a commit to potiuk/airflow that referenced this pull request Oct 29, 2024
After apache#42505, you need to get through extra hoops to develop
providers in Airflow's monorepo.

This is a simple (not yet uv-specific) documentation on how to
install providers in editable mode when you want to develop
providers, so that you can run unit test.

Copied mostly from apache#43082

It's not yet full set of docs explaining how to use workspaces
and UV. This shoudl be handled via  apache#43200
kaxil added a commit that referenced this pull request Oct 29, 2024
…3468)

* Add simple instructions for installing providers in editable mode

After #42505, you need to get through extra hoops to develop
providers in Airflow's monorepo.

This is a simple (not yet uv-specific) documentation on how to
install providers in editable mode when you want to develop
providers, so that you can run unit test.

Copied mostly from #43082

It's not yet full set of docs explaining how to use workspaces
and UV. This shoudl be handled via  #43200

* Apply suggestions from code review

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

---------

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

breeze testing tests provider/tests/x.py currently fails. (Temp workaround: breeze shell then run pytest provider/tests/x.py directly. Bug is that the testing test command includes tests in the pytest command even when another file is specified, leading to pytest loading tests/conftest.py and providers/tests/conftest.py which it doesn't like as they are both "top level")

This was never supposed to work. the parameters that you can pass to "testing tests" is really "extra pytest paramaters" you can pass - and it's not really inteded to use "pytest files" there. That's why for example the extra parameters are not autocomplete'able with files/paths. The "tests" command is really meant to run the whole "test type" and you can easily use it now.

For example running all airbyte tests can be achieved by:

breeze testing tests --test-type "Providers[airbyte]"

And it nicely works now.

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

(so I think indeed all known "teething" problems" are solved - and i can try to attempt to split out individual providers).

potiuk added a commit to potiuk/airflow that referenced this pull request Oct 30, 2024
When providers have been moved in apache#42505 broke passing parameters
when provider tests were passed as extra args of "testing tests"
command of breeze.

Previously all tests were under "tests" folder and there was
an exclusion to disable "All" tests when any test was passed as
parameter. But after moving tests to "providers" this stopped working.

Additional exclusion needs to be added for "providers/tests" and
"providers/tests_sdk/".

This PR also adds autocompletion for tests passed this way by
setting the click type to Path for the extra args (but without
the need for the Path to exist).

Also during this check it turned out that "All" tests are not
working in the intended way - but this should not impact our CI
only local runs.

Appropriate comment has been added and it's captured in
apache#42632
potiuk added a commit to potiuk/airflow that referenced this pull request Oct 31, 2024
When providers have been moved in apache#42505 broke passing parameters
when provider tests were passed as extra args of "testing tests"
command of breeze.

Previously all tests were under "tests" folder and there was
an exclusion to disable "All" tests when any test was passed as
parameter. But after moving tests to "providers" this stopped working.

Additional exclusion needs to be added for "providers/tests" and
"providers/tests_sdk/".

This PR also adds autocompletion for tests passed this way by
setting the click type to Path for the extra args (but without
the need for the Path to exist).

Also during this check it turned out that "All" tests are not
working in the intended way - but this should not impact our CI
only local runs.

Appropriate comment has been added and it's captured in
apache#42632
potiuk added a commit that referenced this pull request Oct 31, 2024
When providers have been moved in #42505 broke passing parameters
when provider tests were passed as extra args of "testing tests"
command of breeze.

Previously all tests were under "tests" folder and there was
an exclusion to disable "All" tests when any test was passed as
parameter. But after moving tests to "providers" this stopped working.

Additional exclusion needs to be added for "providers/tests" and
"providers/tests_sdk/".

This PR also adds autocompletion for tests passed this way by
setting the click type to Path for the extra args (but without
the need for the Path to exist).

Also during this check it turned out that "All" tests are not
working in the intended way - but this should not impact our CI
only local runs.

Appropriate comment has been added and it's captured in
#42632
potiuk added a commit to potiuk/airflow that referenced this pull request Nov 3, 2024
The compatibility tests in CI are using providers built as packages
from sources, so the compatibility tests run there using
"providers/tests" work just fine, because all providers are installed
in the airflow.providers site library. However when we are iterating
and debugging backwards compatiblity provider tests, we should be
able to use local provider sources, rather than installed packages
and we have the possibility of mounting both - providers sources
and tests to the image.

See `contributing-docs/testing/unit_tests.rst` on how to do it
by using ``--mount-sources providers-and-tests`` flag connected
with `--use-airflow-version`.

However as of apache#42505 this has been broken, because currently in
main we  rely on airflow having "pkgutil" namespace package for
both - airflow, and airflow.providers packages (previous airflow
versions had implicit package for airflow.providers package) - so
providers installed locally cannot be used as "another" source of
providers.  Previously it was working because both "installed" and
"sources" `airflow.providers` package were implicit namespace packages.

As explained in https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages

> Every distribution that uses the namespace package must include such
> an `__init__.py`. If any distribution does not, it will cause the
> namespace logic to fail and the other sub-packages will not be
> importable. Any additional code in __init__.py will be inaccessible.

So because old airflow uses implicit provider's packages and
main airflow from source uses "explicit" provider's package,
the only way we can make the "source" providers is to mount
them or symbolically link them to inside installed distribution
of airflow package (in site directory) (or dynamically remove
the __init__.py from provider's source directory.

We cannot mount the provider package sources ot inside the
installed airflow - because when --use-airflow-version is used,
airflow is installed dynamically inside the container - after
the container is started.

This PR solves the problem by adding an env variable that
will make the initialization script to remove the installed
airflow.providers folder after installing airflow and linking the
"providers/src/airflow/providers" folder there. This has the
added benefit that all providers (including the preinstalled ones)
are used from "main" sources rather than from installed packages -
which was problematic for the past way of using providers from
sources - which used the fact that both "airflow.providers" in
the site-library and the one in sources were implicit namespace
packages.
potiuk added a commit that referenced this pull request Nov 4, 2024
…ws (#43617)

* Enable back iterative development of latest providers with old airflows

The compatibility tests in CI are using providers built as packages
from sources, so the compatibility tests run there using
"providers/tests" work just fine, because all providers are installed
in the airflow.providers site library. However when we are iterating
and debugging backwards compatiblity provider tests, we should be
able to use local provider sources, rather than installed packages
and we have the possibility of mounting both - providers sources
and tests to the image.

See `contributing-docs/testing/unit_tests.rst` on how to do it
by using ``--mount-sources providers-and-tests`` flag connected
with `--use-airflow-version`.

However as of #42505 this has been broken, because currently in
main we  rely on airflow having "pkgutil" namespace package for
both - airflow, and airflow.providers packages (previous airflow
versions had implicit package for airflow.providers package) - so
providers installed locally cannot be used as "another" source of
providers.  Previously it was working because both "installed" and
"sources" `airflow.providers` package were implicit namespace packages.

As explained in https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages

> Every distribution that uses the namespace package must include such
> an `__init__.py`. If any distribution does not, it will cause the
> namespace logic to fail and the other sub-packages will not be
> importable. Any additional code in __init__.py will be inaccessible.

So because old airflow uses implicit provider's packages and
main airflow from source uses "explicit" provider's package,
the only way we can make the "source" providers is to mount
them or symbolically link them to inside installed distribution
of airflow package (in site directory) (or dynamically remove
the __init__.py from provider's source directory.

We cannot mount the provider package sources ot inside the
installed airflow - because when --use-airflow-version is used,
airflow is installed dynamically inside the container - after
the container is started.

This PR solves the problem by adding an env variable that
will make the initialization script to remove the installed
airflow.providers folder after installing airflow and linking the
"providers/src/airflow/providers" folder there. This has the
added benefit that all providers (including the preinstalled ones)
are used from "main" sources rather than from installed packages -
which was problematic for the past way of using providers from
sources - which used the fact that both "airflow.providers" in
the site-library and the one in sources were implicit namespace
packages.

* Update Dockerfile.ci

Co-authored-by: GPK <gopidesupavan@gmail.com>

* Update scripts/docker/entrypoint_ci.sh

Co-authored-by: GPK <gopidesupavan@gmail.com>

---------

Co-authored-by: GPK <gopidesupavan@gmail.com>
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
This is a no-op change right now, but as part of the provider re-org in apache#42505
this sets us up to be able to load the providers code in the tests

The reason this change is done separately is that changes to breeze code form
forks doesn't take effect, and this small change makes it easier to land on
main without having to re-create that large PR.
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…roject (apache#42505)

This is only a partial split so far. It moves all the code and tests, but
leaves the creation of `core/` to a separate PR as this is already large
enough.

In addition to the straight file rename the other changes I had to make here
are:

- Some mypy/typing fixes.

  Mypy can be fragile about what it picks up when, so maybe some of those
  changes were caused by that. But the typing changes aren't large.

- Improve typing in common.sql type stub

  Again, likely a mypy file oddity, but the types should be safe

- Removed the `check-providers-init-file-missing` check

  This isn't needed now that airflow/providers shouldn't exist at all in the
  main tree.

- Create a "dev.tests_common" package that contains helper files and common
  pytest fixtures

  Since the provider tests are no longer under tests/ they don't automatically
  share the fixtures from the parent `tests/conftest.py` so they needed
  extracted.

  Ditto for `tests.test_utils` -- they can't be easily imported in provider
  tests anymore, so they are moved to a more explicit shared location.

In future we should switch how the CI image is built to make better use of UV
caching than our own approach as that would remvoe a lot of custom code.
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…roject (apache#42505) (apache#42624)

This is only a partial split so far. It moves all the code and tests, but
leaves the creation of `core/` to a separate PR as this is already large
enough.

In addition to the straight file rename the other changes I had to make here
are:

- Some mypy/typing fixes.

  Mypy can be fragile about what it picks up when, so maybe some of those
  changes were caused by that. But the typing changes aren't large.

- Improve typing in common.sql type stub

  Again, likely a mypy file oddity, but the types should be safe

- Removed the `check-providers-init-file-missing` check

  This isn't needed now that airflow/providers shouldn't exist at all in the
  main tree.

- Create a "dev.tests_common" package that contains helper files and common
  pytest fixtures

  Since the provider tests are no longer under tests/ they don't automatically
  share the fixtures from the parent `tests/conftest.py` so they needed
  extracted.

  Ditto for `tests.test_utils` -- they can't be easily imported in provider
  tests anymore, so they are moved to a more explicit shared location.

In future we should switch how the CI image is built to make better use of UV
caching than our own approach as that would remvoe a lot of custom code.

Co-authored-by: Ash Berlin-Taylor <ash@apache.org>
Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com>
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
* Move tests common without changes

* Fix docstrings in tests_common

* Move tests_common from "dev" to top-level.

Follow-up after apache#42505 fixing teething problem with tests_common.

Originally in apache#42505 common test code was moved to "dev" folder, but
the "dev" folder is really dedicated to "build" scripts and the
problem with moving "tests_common" to the folder was that the
whole "dev" folder is replaced (for non-committer PRs) with
the content from the target branch.

This is done for security reasons, because we can accidentally
use any of the scripts from dev in the CI build scripts and we
might not notice, which will open us to a security issue where
a file in "dev" coming from PR could be accidentally executed
during the "pull_request_target" workflow - which would expose
our secrets and GitHub Package write permissions to a
contributor coming from a fork.

This change moves the files, fixes pre-commit specification and
docs, also fixes a number of "doc" issues detected by "ruff" in
the tests_common folder as they were detected after the move.
The tests_common folder is added to folders mounted when
breeze is executed with local folders mounted (in order to
avoid accidental mounting of randomly generated files to
inside the breeze container).

All imports for the common tests were updated to reflect this
move.
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…ache#43173)

Cleans-up airflow and providers `__init__.py" files in order to
get providers import work again.

This is done by excluding the two `__init__.py` files from
automated ruff isort rules adding `from __future__ import annotations`.

That should finally get rid of the Intellij teething import
problem that has been introduced in apache#42505.

There were earlier - unsuccessful - attempts to fix it in
the apache#43116 and apache#43081 that followed apache#42951 - but the key is that Pycharm
requires the namespace's extend_path to be first "real" line
of code in the `__init__.py` to understand that the package
is an "explicit" namespace package - and it conflicts with
the requirement of "from __future__ import annotations" to be
the first line of Python code.

Also this PR fixes following problem:

* pytest_plugin expecting .asf.yml in "airflow" sources - even during
  compatibility tests with older version of airflow (where the
  .asf.yml is not present)

---------

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…43273)

While the apache#43260 attempted to address the problem where example
dag importability tests should skip provider tests on non-main,
it did not actually solve the problem.

While debugging it, it turned out that since apache#42505, the provider
tests were not executed in main "at all" - the "providers"
directory was not included in the list of places to check for
the example dags (they were in "airflow" in v2-10-test") this is
why it "looked like" the solution worked in "main".

This PR fixes both problems:

* brings back importability of provider's example_dags in main branch
* properly excludes the providers examples in non-main branch
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
Another teething problem after moving providers in apache#42505. After
moving providers, the history of the current folder in "providers"
only contains changes after the move - it does not include
changes from before the move - and since we always regenerate
the full list of commits - they were missing.

We cannot use `--follow` - because `git log --follow` only
works for single files, not directories, but since the move
was very predictable ("airflow/providers/nnn" ->
"airflow/providers/src/airflow/providers/nnn") we can add the old
path to `git log` command to get both - pre and post move commit
history.
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…ache#43468)

* Add simple instructions for installing providers in editable mode

After apache#42505, you need to get through extra hoops to develop
providers in Airflow's monorepo.

This is a simple (not yet uv-specific) documentation on how to
install providers in editable mode when you want to develop
providers, so that you can run unit test.

Copied mostly from apache#43082

It's not yet full set of docs explaining how to use workspaces
and UV. This shoudl be handled via  apache#43200

* Apply suggestions from code review

Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>

---------

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…43529)

When providers have been moved in apache#42505 broke passing parameters
when provider tests were passed as extra args of "testing tests"
command of breeze.

Previously all tests were under "tests" folder and there was
an exclusion to disable "All" tests when any test was passed as
parameter. But after moving tests to "providers" this stopped working.

Additional exclusion needs to be added for "providers/tests" and
"providers/tests_sdk/".

This PR also adds autocompletion for tests passed this way by
setting the click type to Path for the extra args (but without
the need for the Path to exist).

Also during this check it turned out that "All" tests are not
working in the intended way - but this should not impact our CI
only local runs.

Appropriate comment has been added and it's captured in
apache#42632
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…ws (apache#43617)

* Enable back iterative development of latest providers with old airflows

The compatibility tests in CI are using providers built as packages
from sources, so the compatibility tests run there using
"providers/tests" work just fine, because all providers are installed
in the airflow.providers site library. However when we are iterating
and debugging backwards compatiblity provider tests, we should be
able to use local provider sources, rather than installed packages
and we have the possibility of mounting both - providers sources
and tests to the image.

See `contributing-docs/testing/unit_tests.rst` on how to do it
by using ``--mount-sources providers-and-tests`` flag connected
with `--use-airflow-version`.

However as of apache#42505 this has been broken, because currently in
main we  rely on airflow having "pkgutil" namespace package for
both - airflow, and airflow.providers packages (previous airflow
versions had implicit package for airflow.providers package) - so
providers installed locally cannot be used as "another" source of
providers.  Previously it was working because both "installed" and
"sources" `airflow.providers` package were implicit namespace packages.

As explained in https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages

> Every distribution that uses the namespace package must include such
> an `__init__.py`. If any distribution does not, it will cause the
> namespace logic to fail and the other sub-packages will not be
> importable. Any additional code in __init__.py will be inaccessible.

So because old airflow uses implicit provider's packages and
main airflow from source uses "explicit" provider's package,
the only way we can make the "source" providers is to mount
them or symbolically link them to inside installed distribution
of airflow package (in site directory) (or dynamically remove
the __init__.py from provider's source directory.

We cannot mount the provider package sources ot inside the
installed airflow - because when --use-airflow-version is used,
airflow is installed dynamically inside the container - after
the container is started.

This PR solves the problem by adding an env variable that
will make the initialization script to remove the installed
airflow.providers folder after installing airflow and linking the
"providers/src/airflow/providers" folder there. This has the
added benefit that all providers (including the preinstalled ones)
are used from "main" sources rather than from installed packages -
which was problematic for the past way of using providers from
sources - which used the fact that both "airflow.providers" in
the site-library and the one in sources were implicit namespace
packages.

* Update Dockerfile.ci

Co-authored-by: GPK <gopidesupavan@gmail.com>

* Update scripts/docker/entrypoint_ci.sh

Co-authored-by: GPK <gopidesupavan@gmail.com>

---------

Co-authored-by: GPK <gopidesupavan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants